Skip to content

DOC GH17505 Added some links and examples. To little/much/wrong? #17908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

DOC GH17505 Added some links and examples. To little/much/wrong? #17908

wants to merge 1 commit into from

Conversation

linebp
Copy link
Contributor

@linebp linebp commented Oct 17, 2017

Before I spend more time on this, I'd like to know if I am doing to much, to little or just plain all wrong.

@codecov
Copy link

codecov bot commented Oct 17, 2017

Codecov Report

Merging #17908 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17908      +/-   ##
==========================================
- Coverage   91.23%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       50105    50105              
==========================================
- Hits        45715    45706       -9     
- Misses       4390     4399       +9
Flag Coverage Δ
#multiple 89.03% <ø> (ø) ⬆️
#single 40.31% <ø> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bf7f9a...ccfa848. Read the comment docs.


df.groupby(['A', 'B']).sum().reset_index()
``count``, Number of non-NA observations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can be links as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The links to the functions available with aggregate? I think that would be a great idea. Where can I find the list of available functions and the shortcuts? I figured that must have been documented elsewhere at some point but couldnt find it?

@jreback
Copy link
Contributor

jreback commented Oct 17, 2017

can you post a rendered version of this page? since you are doing lots of changes, hard to see what the new version would look like.

@linebp
Copy link
Contributor Author

linebp commented Oct 19, 2017


Group By: split-apply-combine


Split-apply-combine is a common paradigm in data analysis. It involves splitting
the data set into smaller groups, applying some operation to each group
independently and combining the results into a data structure. This strategy is
supported for example in Excels pivot tables, SQLs group by operator and Rs
plyr package. This section will look at the the Pandas groupby and
related functions and show you how to do split-apply-combine in Pandas. See the
:ref:cookbook<cookbook.grouping> for some advanced strategies

The split step is the most straightforward. See the section on
:ref:splitting<groupby.split> below.

In the apply step you may wish to apply one of the following operations:

  • Aggregate: Get a single value for each group. This could be a summary
    statistic like sum or mean of some column or counting the number of members
    in the group. See the section on :ref:aggregating<groupby.aggregate> below.
  • Filter: When you want a subset of your original data. Discard data
    according to some function applied to each group. This can be useful when
    for example you wish to discard groups with low member count. See the section on
    :ref:filtering<groupby.filter> below.
  • Transform: A new value for each original row. This can be used to
    normalize/scale data or filling in erroneous or missing values. See the
    section on :ref:transforming<groupby.transform> below.
    Pandas has direct support for these three operations and will try and return a
    sensibly combined result. See here for further help on when to use aggregate/filter/transform in Pandas <https://pythonforbiologists.com/when-to-use-aggregatefiltertransform-in-pandas/>_.

Pandas also supports iteration over the groups created in the split step; Using
iteration over the groups (rather than the three shortcut functions) renders
more control over the apply and combine parts of the process, but also requires
more work from the programmer. See the section in
:ref:iterating<groupby.iterating> below.

@linebp
Copy link
Contributor Author

linebp commented Oct 19, 2017

Aggregation

This section describes how to aggregate data. We will be giving examples using the <span class="pre">tips.csv</span> dataset. Each row represents a meal at some restaurant; The columns store the value of the total bill, the size if the tip and some metadata about the customer.

In [54]: tips = pd.read_csv('./data/tips.csv')

In [55]: tips
Out[55]: 
 total_bill   tip     sex smoker   day    time  size
0         16.99  1.01  Female     No   Sun  Dinner     2
1         10.34  1.66    Male     No   Sun  Dinner     3
2         21.01  3.50    Male     No   Sun  Dinner     3
3         23.68  3.31    Male     No   Sun  Dinner     2
4         24.59  3.61  Female     No   Sun  Dinner     4
5         25.29  4.71    Male     No   Sun  Dinner     4
6          8.77  2.00    Male     No   Sun  Dinner     2
..          ...   ...     ...    ...   ...     ...   ...
237       32.83  1.17    Male    Yes   Sat  Dinner     2
238       35.83  4.67  Female     No   Sat  Dinner     3
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2

[244 rows x 7 columns]

What if we wanted to know the average total bill on each day? We split the data so that each group consists of all the meals eaten on the same day. We want a single value for each group, so we should use the aggregate function:

In [56]: tips.groupby('day').aggregate('mean')
Out[56]: 
 total_bill       tip      size
day 
Fri    17.151579  2.734737  2.105263
Sat    20.441379  2.993103  2.517241
Sun    21.410000  3.255132  2.842105
Thur   17.682742  2.771452  2.451613

The result has the group names, in this case the days, as the index along the grouped axis. Along the other axis we have the columns for which Pandas could calculate a mean, i.e. the ones with a numeric data type. We could have selected the <span class="pre">total_bill</span> column either before or after aggregating to limit the result to this column, but not before splitting since we need the <span class="pre">day</span> column for the splitting.

How about the number of guests for each day and for each time of day? In this case it is not enough to split the data on the day it was eaten, we also need split by the time of day. Instead of calculating the mean, like in the previous example we use the <span class="pre">sum</span> function.

In [57]: tips.groupby(['day', 'time'])['size'].agg('sum')
Out[57]: 
day   time 
Fri   Dinner     26
 Lunch      14
Sat   Dinner    219
Sun   Dinner    216
Thur  Dinner      2
 Lunch     150
Name: size, dtype: int64

<span class="pre">agg</span> is short for <span class="pre">aggregate</span>.

Pandas has support for a number of basic descriptive statistic functions which can be used with aggregate:
count, Number of non-NA observations
sum, Sum of values
mean, Mean of values
mad, Mean absolute deviation
median, Arithmetic median of values
min, Minimum
max, Maximum
mode, Mode
abs, Absolute Value
prod, Product of values
std, Bessel-corrected sample standard deviation
var, Unbiased variance
sem, Standard error of the mean
skew, Sample skewness (3rd moment)
kurt, Sample kurtosis (4th moment)
quantile, Sample quantile (value at %)

What if we need to know the difference between the smallest and largest total bill for each day? Again we split the data so each group has the meals eaten on the same day. But which function do we use to find the difference? The <span class="pre">agg</span> function also accepts a function as argument. The function is called once per group with the current group as argument and should return a single value.

In [58]: tips.groupby(['size']).agg(lambda group: max(group) - min(group))['total_bill']
Out[58]: 
size
1     7.00
2    34.80
3    40.48
4    31.84
5    20.50
6    21.12
Name: total_bill, dtype: float64

@linebp
Copy link
Contributor Author

linebp commented Oct 19, 2017

The formatting is not super but I hope it gives a better idea of whats been changed and how it would look.
I pasted the 2 sections I rewrote above, it should be obvious where they belong in the original?

Is there a better way to do this?

@jreback
Copy link
Contributor

jreback commented Nov 23, 2017

@linebp can you post a rendered screenshot of this?

@jreback
Copy link
Contributor

jreback commented Jan 21, 2018

can you rebase and show a rendered screenshot?

@jreback
Copy link
Contributor

jreback commented Feb 24, 2018

@linebp sorry let this get away from us. happy to have some clarifications, but can you do in a more targeted manner, IOW more PR's with smaller changes is usually better.

@jreback jreback closed this Feb 24, 2018
@linebp
Copy link
Contributor Author

linebp commented Feb 28, 2018

Ill have a look at again and see if I can do a PR with a smaller change and do a screenschot of the rendered changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: nice links / examples for setting with copy & aggregation
2 participants